Skip to content

Record: Order-16 Frozen N-gram Oracle + Learned Gate + TTT — val_bpb 0.0274 (3-seed mean)#945

Closed
TimPietrusky wants to merge 1 commit intoopenai:mainfrom
TimPietrusky:submit/order16-frozen-oracle
Closed

Record: Order-16 Frozen N-gram Oracle + Learned Gate + TTT — val_bpb 0.0274 (3-seed mean)#945
TimPietrusky wants to merge 1 commit intoopenai:mainfrom
TimPietrusky:submit/order16-frozen-oracle

Conversation

@TimPietrusky
Copy link
Copy Markdown

Record Summary

val_bpb: 0.02742 (3-seed mean, std 0.00003) | 8xH100 SXM | eval <=400s

3-Seed Results

Seed val_bpb
1337 0.02744
42 0.02739
2025 0.02744
Mean 0.02742
Std 0.00003

Method

1. Order-16 Frozen N-gram Oracle

Pre-filled from all training shards at startup. 4M buckets, orders 2-16 with backoff. The oracle provides per-order n-gram probabilities that are blended with neural predictions.

2. Learned Multi-Expert Gate

A nn.Linear(512, 17) head (1 neural + 16 n-gram order experts) trained end-to-end with mixer loss (mixer_loss_weight=0.15). Predicts optimal per-token, per-order blending weights via softmax. Neural expert gets a 5% floor.

3. Complementary Training

Reduces CE loss weight for tokens well-predicted by the oracle (complement_alpha=0.5, complement_threshold=0.3). Forces the neural model to specialize on tokens the n-gram cache can't predict.

4. Score-First TTT

1 epoch AdamW (lr=0.001) on all blocks with adaptive temperature ([0.9, 1.05]) and byte-weighted loss. Unfreezes alpha_head, norms, scales, lm_head during TTT.

5. Model Architecture

  • 11 layers, 512 dim, 8 heads, 8 KV heads
  • MLP 3.5x with LeakyReLU(0.5)²
  • XSA-all, partial RoPE (16 dims), VE(128) on layers 9-10
  • BigramHash (6144 vocab, 128 dim)
  • EMA(0.997), SWA every 50 steps, warmdown=3500
  • Int5 + zstd quantization with 3% pruning

Submission Checklist

  • One new folder under records/track_10min_16mb/
  • Included README.md, submission.json, train_gpt.py
  • 3 train logs (seeds 1337, 42, 2025)
  • Eval <= 600s on 8xH100 (~400s)
  • Score-first evaluation maintained
  • N-gram oracle uses training data (legal status under review per RFC RFC: How to Clean Up All the Parameter Golf Submissions #886)

sunnypatneedi pushed a commit to sunnypatneedi/parameter-golf that referenced this pull request Mar 27, 2026
- Update merged SOTA to 1.1194 (abaybektursun, was 1.1228 signalrush)
- Add competition strategy pivot: n-gram eval cache now dominates (~0.02-0.97 bpb)
- Document PR openai#727 (0.9674), openai#741 (0.9850), openai#945 (0.0274), openai#961 (0.0881) findings
- Add Lessons Learned entries 17-20 on n-gram dominance + memorization risk
- Update Technique Reference table with n-gram entries

https://claude.ai/code/session_01Bpr2fKEnkNQmNKno8EnxWF
sunnypatneedi pushed a commit to sunnypatneedi/parameter-golf that referenced this pull request Mar 27, 2026
Merge remote's two-pass n-gram discoveries (PR openai#868 0.1181, PR openai#870 0.0935)
with today's extreme n-gram findings (PR openai#945 0.0274, PR openai#961 0.0881).
Keep Architecture Decisions and Legal TTT Protocol from remote.
Add Lessons Learned 17-20 from 2026-03-27 research.

https://claude.ai/code/session_01Bpr2fKEnkNQmNKno8EnxWF
@valerio-oai
Copy link
Copy Markdown
Contributor

Thanks for your submission! Unfortunately, it's disallowed due to the use of hashed n-gram caches, which do not renormalize correctly / correctly reweight the LM's token distribution, look ahead to the target token to mix probabilities and therefore leak eval tokens. Please refer to the long discussion about this under the issues tab for more details, and please submit more runs in the future!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants